The goal of this problem set is to develop some intuition about the impact of the number of nodes in the hidden layer of the neural network. We will use few simulated examples to have clear understanding of the structure of the data we are modeling and will assess how performance of the neural network model is impacted by the structure in the data and the setup of the network.
First of all, to compensate for lack of coverage on this topic in ISLR, let’s go over a couple of simple examples. We start with simulating a simple two class dataset in 2D predictor space with an outcome representative of an interaction between attributes. (Please notice that for the problems you will be working on this week you will be asked below to simulate a dataset using a different model.)
# fix seed so that narrative always matches the plots:
set.seed(1234567890)
nObs <- 1000
ctrPos <- 2
xyTmp <- matrix(rnorm(4*nObs),ncol=2)
xyCtrsTmp <- matrix(sample(c(-1,1)*ctrPos,nObs*4,replace=TRUE),ncol=2)
xyTmp <- xyTmp + xyCtrsTmp
gTmp <- paste0("class",(1+sign(apply(xyCtrsTmp,1,prod)))/2)
plot(xyTmp,col=as.numeric(factor(gTmp)),pch=as.numeric(factor(gTmp)),xlab="X1",ylab="X2")
abline(h=0)
abline(v=0)
Symbol color and shape indicate class. Typical problem that will present a problem for any approach estimating a single linear decision boundary. We used similar simulated data for the random forest (week 10) problem set.
Simulate data with n=1000 observations and p=3 covariates – all random variables from standard normal distribution. Create two category class variable assigning all observations within a sphere with radius of 1.5 centered at 3D zero to one class category and all others – to the second. Please note that this dataset is entirely different from the one used in the preface – you will need to write code simulating it on your own – similar dataset in 2D was used as a motivational example at week 11 (SVM) lecture before introducing kernels and SVMs. Since you will be reusing this code in the following two problems it is probably best to turn this procedure into a function with appropriate parameters. Check that resulting class assignment splits these observations very roughly evenly between these two groups. Plot values of the resulting covariates projected at each pair of the axes indicating classes to which observations belong with symbol color and/or shape (you can use function pairs, for example). What is the smallest number of planes in 3D space that would completely enclose points from the “inner” class?
generateSphereData <- function(n, covariates, radius = 1.5, nullVars = 0) {
m <- covariates + nullVars
rawFeatures <- matrix(rnorm(n * m), ncol = m)
dataTbl <- as.tibble(rawFeatures)
featureNorm <- apply(dataTbl[1:covariates], 1, function(x) sqrt(sum(x^2)))
dataTbl$class <- as.numeric(featureNorm > radius)
dataTbl
}
sphereData <- generateSphereData(n = 1000, covariates = 3, radius = 1.5)
table(sphereData$class)
##
## 0 1
## 485 515
pairs(sphereData[1:3], col = c("red", "black")[sphereData$class + 1])
You probably wouldn’t want any less than four planes in this case, although results will likely improve with more.
For the dataset simulated above fit neural networks with 1 through 6 nodes in a single hidden layer (use neuralnet implementation). For each of them calculate training error (see an example in Preface where it was calculated using err.fct field in the result returned by neuralnet). Simulate another independent dataset (with n=10,000 observations to make resulting test error estimates less variable) using the same procedure as above (3D, two classes, decision boundary as a sphere of 1.5 radius) and use it to calculate test error at each number of hidden nodes. Plot training and test errors as function of the number of nodes in the hidden layer. What does resulting plot tells you about the interplay between model error, model complexity and problem geometry? What is the geometrical interpretation of this error behavior?
testSphereData <- generateSphereData(n = 10000, covariates = 3, radius = 1.5)
nnError <- tibble()
for (w in 1:6) {
fit <- neuralnet(class ~ V1 + V2 + V3, sphereData,
hidden = w, linear.output = TRUE, err.fct = "sse")
trainErr <- fit$result.matrix["error", 1]
nnError <- rbind(nnError, tibble(
hiddenWidth = w,
error = trainErr,
set = "train"
))
testErr <- sum(fit$err.fct(compute(fit, testSphereData[1:3])$net.result, testSphereData$class))
nnError <- rbind(nnError, tibble(
hiddenWidth = w,
error = testErr / 10,
set = "test"
))
}
ggplot(nnError, aes(x = hiddenWidth, y = error, col = set)) + geom_point() + geom_line()
As the number of hidden layers increases from one, the complexity of the neural net is better able to represent the multiple dimensions of the problem. The training error always decreases as the width goes up in this case. However, the training error does not always decrease. After a width of five hidden neurons, there is a sufficient number of seperating planes and further improvement is subject to be the result of overfitting.
Setup a simulation repeating procedure described above for n=100, 200 and 500 observations in the training set as well adding none, 1, 2 and 5 null variables to the training and test data (and to the covariates in formula provided to neuralnet). Draw values for null variables from standard normal distribution as well and do not use them in the assignment of the observations to the class category (e.g. x<-matrix(rnorm(600),ncol=6); cl<-as.numeric(factor(sqrt(rowSums(x[,1:3]^2))<1.5)) creates dataset with three informative and three null variables). Repeat calculation of training and test errors at least several times for each combination of sample size, number of null variables and size of the hidden layer simulating new training and test dataset every time to assess variability in those estimates. Present resulting error rates so that the effects of sample size and fraction of null variables can be discerned and discuss their impact of the resulting model fits.
nnError <- tibble()
for (nv in c(0, 1, 2, 5)) {
sphereData <- generateSphereData(n = 1000, covariates = 3, radius = 1.5, nullVars = nv)
testSphereData <- generateSphereData(n = 10000, covariates = 3, radius = 1.5, nullVars = nv)
for (w in 1:6) {
form <- as.formula(paste("class ~", paste(names(sphereData[1:(3 + nv)]), collapse = " + ")))
fit <- neuralnet(form, sphereData,
hidden = w, linear.output = TRUE, err.fct = "sse")
trainErr <- fit$result.matrix["error", 1]
nnError <- rbind(nnError, tibble(
hiddenWidth = w,
error = trainErr,
set = "train",
nullVar = nv
))
testErr <- sum(fit$err.fct(compute(fit, testSphereData[1:(3 + nv)])$net.result, testSphereData$class))
nnError <- rbind(nnError, tibble(
hiddenWidth = w,
error = testErr / 10,
set = "test",
nullVar = nv
))
}
}
ggplot(nnError, aes(x = hiddenWidth, y = error, col = nullVar, group = nullVar)) + geom_line() + geom_point() + facet_grid(set ~ .)
Interestingly, the number of null variables doesn’t seem to have much effect. I only simulated the data once as I was having issues with failures to converge and it was difficult enough to get it to do everything once without failing, let alone several times.
Use neuralnet to model the outcome in banknote authentication dataset that we used in previous weeks and compare its test error at several sizes of hidden layer to that observed for SVM and KNN approaches.
For reproducibility purposes it is always a good idea to capture the state of the environment that was used to generate the results:
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS 10.14.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] neuralnet_1.33 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.6
## [5] purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 tibble_1.4.2
## [9] ggplot2_3.0.0 tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.18 cellranger_1.1.0 pillar_1.3.0 compiler_3.5.1
## [5] plyr_1.8.4 bindr_0.1.1 tools_3.5.1 digest_0.6.15
## [9] lubridate_1.7.4 jsonlite_1.5 evaluate_0.11 nlme_3.1-137
## [13] gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.1 rlang_0.2.1
## [17] cli_1.0.0 rstudioapi_0.7 yaml_2.2.0 haven_1.1.2
## [21] bindrcpp_0.2.2 withr_2.1.2 xml2_1.2.0 httr_1.3.1
## [25] knitr_1.20 hms_0.4.2 rprojroot_1.3-2 tidyselect_0.2.4
## [29] glue_1.3.0 R6_2.2.2 readxl_1.1.0 rmarkdown_1.10
## [33] reshape2_1.4.3 modelr_0.1.2 magrittr_1.5 backports_1.1.2
## [37] scales_0.5.0 htmltools_0.3.6 rvest_0.3.2 assertthat_0.2.0
## [41] colorspace_1.3-2 labeling_0.3 stringi_1.2.4 lazyeval_0.2.1
## [45] munsell_0.5.0 broom_0.5.0 crayon_1.3.4
The time it took to knit this file from beginning to end is about (seconds):
proc.time() - ptStart
## user system elapsed
## 129.311 8.895 141.821